Efficient  Synchronization  Algorithms  Using  Fetch  &  Add 
on  Multiple  Bitfield  Integers 

by 

Eric  Freudenthal 

Olivier  Peze 

Ultracomputer  Note  #148 
February,  1988 


This  work  was  carried  out  as  part  of  the  research  program  on  parallel  processing  at  the  Ultracomputer 
Research  Laboratory  of  the  Courant  Institute  of  Mathematical  Sciences.  It  was  supported  under  a  Joint  Study 
Agreement  with  IBM,  grant  number  IBM-N0039-84-R-0605. 

Current  Address  for  Olivier  Peze:  Ecole  Nationale  Superieure  des  Telecommunications  de  Paris,  France. 


ABSTRACT 

Efficient  interprocess  synchronization  tools  for  MIMD  shared  memory  com- 
puters are  critically  important  for  the  success  of  such  systems.  As  members 
of  the  Ultracomputer  Project  at  NYU,  the  authors  have  been  participating  in 
the  continuing  development  of  such  algorithms.  From  their  work,  partially 
represented  in  this  paper,  they  have  observed  that  many  of  the  more  effi- 
cient algorithms  utilize  the  storage  and  manipulation  of  more  than  one  state 
variable  in  a  single  machine  integer  using  fetch-and-add  (see  [GLR  83]). 
This  paper  describes  this  technique  and  presents  several  new  implementa- 
tions of  busy-waiting  synchronization  functions  which  utilize  this  facility. 

1.  Introduction 

The  NYU  Ultracomputer  is  a  shared  memory  MIMD  computer  (see  [Gottlieb  86]). 
Each  processor  has  a  local  cache  for  private  data  and  code,  and  is  provided  access  to  an 
interleaved  shared  memory  via  a  logarithmic  multistage  interconnection  network.  To  avoid 
contention  in  the  memory  interface  network,  the  ultracomputer  combines  concurrent  load, 
store,  and  fetch-and-add  ^  requests  to  the  same  memory  location. 

Unfortunately,  due  to  the  requisite  network  traversal,  shared  memory  MIMD  comput- 
ers such  as  the  NYU  Ultracomputer  and  the  IBM  RP3  suffer  from  high  latency  on  references 
to  shared  memory.  The  algorithms  presented  in  this  paper  store  several  variables  worth  of 
state  information  in  a  single  machine  word.  These  state  variables  can  be  read  and  modified 
by  a  single  network  reference  (faa),  providing  significant  time  savings  over  other  algorithms 
which  need  to  determine  state  information  in  several  accesses  amd  sometimes  requiring  addi- 
tional locking  mechanisms.  In  addition,  the  ability  to  fetch  and  modify  several  state  vari- 
ables atomically  is  very  convenient  in  the  development  of  synchronization  algorithms. 

Harrison  (see  [Harrison  86,  88])  has  proposed  the  use  of  combining  positive  bitfield 
increments  to  special  add  and  \  processors  which  would  have  been  substituted  for  memory. 
These  add  and  X.  processors  would  directly  support  synchronizations  primitives.  The  value 
returned  by  an  add  and  \  operation  would  indicate  the  state  of  some  lock  rather  than  the 
actual  value  of  a  memory  cell  modified  by  fetch-and-add.  In  contrast,  the  algorithms 
presented  in  this  paper  do  not  utilize  any  special  hardware  except  fetch-and-add. 

2.  Fetch-and-add  on  bitfield  variables 

Bitfield  variable  storage,  sometimes  called  packing,  is  a  method  of  storing  more  than 
one  variable  in  a  single  machine  word.  This  mechanism  is  often  used  in  serial  programs  to 
conserve  space.  The  algorithms  presented  in  this  paper  will  be  using  an  extension  of  this 
technique  which  supports  parallel  access  to  such  variables. 


The  views  and  opinions  expressed  herein  do  not  necessarily  reflect  those  of  IBM. 

^  Fetch-and-add  (faa)  is  a  hardware  primitive  which  allows  any  number  of  processors  to  read  (fetch)  the 
value  of  an  integer  and  add  an  increment  to  it  atomically.  For  example,  the  instruction /aa(<&/, 3)  increments 
the  value  of  /  by  three,  and  returns  the  value  of  /  previous  to  the  increment.  On  the  NYU  Ultracomputer,  and  the 
IBM  RP3,  concurrent  fetch-and-add  requests  to  the  same  memory  address  are  combined  in  the  processor-to- 
memory  network,  yielding  results  which  are  equivalent  to  some  serialization  without  serialization  delays,  (see 
[Gottiieb  861) 


Current  machines  support  32  bit  words  and  macros  (to  be  described  below)  hide  bit 
packing  order.  However  without  loss  of  generality,  for  the  following  discussion,  we  shall 
assume  that  a  machine  word  is  eight  bits  wide,  and  that  a  C  compiler  allocates  bitfield  vari- 
ables (see  [K&R  78])  most  significant  bit  first. 

For  example,  the  following  code,  in  the  language  C  defines  the  variable  x  which 
represents  a  single  eight  bit  machine  word. 

struct  {/*  assume  compiler  assigns  msb  first  */ 

unsigned  4:b; 

unsigned  4:a; 
}x; 

Which  is  represented  as  follows: 


Fields  a  and  b  in  a  Eight  Bit  Machine  Word 

field  name: 

h 

a 

bit  location: 

7 

6 

5 

4 

3 

2 

1 

0 

b  =  3;a  =  2: 

0 

0 

1 

1 

0 

0 

1 

0 

The  least  significant  four  bits,  from  bit  zero  to  three,  are  used  to  store  the  integer  field 
x.a.  The  next  four  bits,  from  bit  four  to  seven,  are  used  to  store  the  other  four  bit  integer 
field  x.b.  Of  course,  these  variables  are  of  restricted  range,  they  can  not  become  negative, 
nor  can  they  have  a  value  which  requires  more  bits  to  represent  than  has  been  allocated. 
Both  x.a  and  x.b  are  four  bit  wide  fields,  which  restrias  their  values  to  the  range  of  zero  to 
15.   (which  is  2''-l). 

Serial  programs  reference  these  fields  by  shifting  and  masking.  Unfortunately,  such 
references  must  occur  serially,  making  them  unusable  for  shared  access  in  a  parallel  shared- 
memory  MIMD  environment. 

Bitfield  variables  can  be  modified  by  addition.  Observe  that  since  the  field  x.a  is 
stored  in  the  least  significant  bits  of  the  machine  word,  fetch-and-add  (faa)  can  directly 
modify  this  field.  Modifying  the  other  field  of  a  is  a  bit  more  tricky.  To  add  8  to  x.b,  8 
must  be  multiplied  by  sixteen  (which  is  equal  to  2**).  For  example,  the  operation 
faa(&x, 3*16)  will  add  three  to  the  field  x.b. 

Fortunately,  fetch-and-add  accesses  to  bitfield  variables  can  be  combined.  For  exam- 
ple: The  instruction  faa{&x .delta)  increments  the  x.a  field  by  8  as  long  as  the  value  of  the 
x.a  field  stays  within  its  restricted  range.  If  this  restriction  was  not  followed,  the  resulting 
carry  or  borrow  would  disturb  the  value  of  other  field  in  the  same  word  of  storage.  In  addi- 
tion, a  snapshot  before  the  increment  of  the  entire  variable  (including  both  of  its  fields)  is 
returned  by  the  fetch-and-add. 

A  fundamental  feature  of  fetch-and-add  on  bitfield  variables  is  that  more  than  one  field 
can  be  modified  by  a  single  addition  if  they  are  all  stored  in  the  same  machine  word.  For 
example,  the  instruction  faa(&x,i+j*\6)  increments  x.a  by  /,  and  x.b  by  ;  atomically  as 
well  as  returning  the  value  of  x  prior  to  this  pair  of  increments. 


Ultracompnter  Note  148 


Page  2 


Both  for  efficiency  and  portability,  macros  have  been  implemented  for  the  NYU  Ultra- 
computer  parallel  C  language  so  that  bitfield  variables  can  be  conveniently  referenced  by 
fetch-and-add.  The  general  form  of  these  macros  is  {aan{varjieldihi[jield„b„]).  For 
example,  the  macro  faalix ,a ,i ,b ,j)  is  equivalent  to  faa{&x ,i  +  j*  \€)  as  explained  in  the  pre- 
vious paragraph. 

3.   Reader-reader  lock 

A  reader-reader  lock  is  a  mutual  exclusion  algorithm  where  processes  are  members  of 
one  of  two  groups  A  and  B.  (see  [Edler  84])  There  is  no  limit  to  the  number  of  processes 
(up  to  some  predetermined  maximum)  in  either  group  which  can  enter  the  critical  section  at 
one  time.  However  only  members  of  one  of  the  groups  can  enter  the  critical  section  at  a 
time.  In  many  algorithms,  a  reader-reader  lock  can  replace  a  binary  semaphore  allowing  a 
higher  degree  of  parallelism.  Several  applications  will  be  discussed  at  the  end  of  this  paper. 
The  original  reader-reader  lock  was  developed  by  Jan  Edler  at  NYU  and  is  used  in  the 
operating  system  and  user-mode  support  code  for  the  NYU  ultracomputer.  An  algorithm 
similar  to  one  presented  in  this  paper  was  developed  concurrently  by  Jerome  Chiabaut  at 
NYU. 

3.1.  The  Algorithm 

This  implementation  described  below  assigns  the  A  lock  higher  priority  than  the  B  lock. 
Although  this  approach  requires  very  few  network  accesses,  it  can  lead  to  starvation  of  the 
lower  priority  lock.  The  end  of  this  section  mentions  some  variations  on  this  algorithm  one 
of  which  does  not  have  this  problem. 

3.1.1.  Variable 

typedef  struct  {  /*  32  bit  integer  storage  */ 

unsigned  aCount:16; 

unsigned  bCount:16; 
}  rrLockT; 
rrLockT  abSem; 

A  long,  shared  integer,  abSem  is  used  to  store  all  values.  It  will  hold  two  bitfield 
values:  abSem.aCount  and  abSem.bCount.  AbSem.aCount  holds  the  count  of  posted  (possibly 
granted)  A  locks,  abSem.bCount  holds  the  maximum  count  of  posted  B  locks.  Both  the 
bCount  and  aCount  fields  are  initialized  to  zero,  indicating  that  neither  lock  is  granted. 
Naturally,  the  number  of  processes  which  can  request  this  lock  must  be  restricted  to  the 
maximum  value  which  can  be  stored  in  either  field  (in  this  case  65535). 

3.1.2.  Lower  Priority  Lock  Implementation 

/•  • 

look  'B'  without  priority  over  look  'A' 
•  •/ 

■  ubroutine  rrLoc)cB(  lock) 
rrLockT  •lock; 
{ 

rrLockT  t; 
top : 

Ultracomputer  Note  148  Page  3 


i 


t  -  faaKlook,  bCount,  1);     /•  posting  --  increment  bCount  •/ 
if  (t.aCount  I-  0)  {  /•  can't  grant  if  'A'  lock  granted  or  poated  •/ 

t  «  faaKlock,  bCount,  -1);  /•  decrement  bCount  «/ 
while  (t.aCount  I-  0)        /*  wait  until  no  'A'  locks  posted  •/ 


} 


t  «  *lock; 
goto  top;  /•  try  again  •/ 

/•  (granted)  •/ 


When  a  lock  is  requested,  it  atomically  increments  its  count  (posting)  while  checking 
that  the  other  count  is  still  zero.  If  so,  then  the  lock  is  granted.  If  the  other  count  is  non- 
zero at  the  time  that  its  own  count  was  incremented,  its  count  is  decremented  (to  indicate 
that  it  does  not  hold  a  lock),  and  then  waits  until  the  other  count's  value  is  zero  due  to  the 
release  of  the  other  lock.  When  this  occurs,  the  waiting  process  restarts  the  requesting  pro- 
cedure again.  If  there  is  no  contention  for  the  lock,  this  routine  requires  one  shared 
memory  access. 

3.1.3.  Higher  Priority  Lock  Request  Implementation 

/•  • 

lock  'A'  with  priority  --  looks  out  further  'B'  requests  until  released 

♦  •/ 

rrPriLockA( lock) 
rrLockT  •lock; 

{ 

rrLockT  t; 

t  -  faaKlock,  aCount ,  1);     /•  post  by  incrementing  aCount  •/ 
while  (t.bCount  l«  0)         /•  wait  for  'B's  to  exit  •/ 
t  ■  'lock; 

/•  (granted)  •/ 
> 

When  a  priority  lock  is  requested,  the  requesting  process  first  posts  its  request  by  incre- 
menting its  count  and  then  waits  until  all  locks  on  the  other  lock  are  released  before  grant- 
ing the  lock.  By  waiting  without  decrementing  its  count,  this  lock  prevents  further  requests 
for  the  other  lock  from  succeeding.  If  there  is  no  contention  for  the  lock,  this  function 
requires  one  network  access. 

3.1.4.  Non-waiting  Qower  priority)  lock  request  implementation 

/•  • 

request  lock  'B'  without  priority 

returns  1  (true)  if  succeeded,  0  otherwise 

•  •/ 

rrReqB( lock) 
rrLockT  •lock; 
{ 


UltracompuUr  Note  148  Page  4 


rrLockT  t; 

if  (lock->aCount  !■  0)  /•  don't  increment  if  unavailable  •/ 

return  0;  /•  not  available,  just  return  'failed'  •/ 

else  { 

t  «  faaldock,  bCount,  1);   /•  only  poet  once  •/ 

if  (t.aCount  l«  0)  {  /•  if  unavailable...  •/ 
faaldock,  bCount,  -1);    /•  .  .  .  juat  decrement  •/ 

return  0;  /•  return  'failed'  •/ 

}  else  /•  lock  acquired  •/ 

return  1;  /•  return  'succeeded'  •/ 
} 


This  routine  requests  a  lock,  but  doesn't  wait  if  there  is  contention.  It  is  useful  in 
applications  where  there  is  other  work  which  can  be  performed  if  the  lock  is  unavailable.  If 
there  is  no  contention  for  the  lock,  this  function  requires  two  network  accesses. 

3.1.5.  Releasing  of  locks 

When  a  lock  is  released,  its  corresponding  count  is  decremented.  This  function  requires 
one  network  access. 

3.1.6.  Changing  active  lock  types 

/•  ♦ 

equiv  to  unlookBC),  priLockA( ) . . .but  atomic 
saves  1  shared  access 
•  •/ 

rrCbangeBtoA( lock) 
rrLockT  'lock; 
{ 

rrLockT  t; 

t  -  faa2(look,  aCount,  +1,  bCount,  -1);  /•  remove  myself  from  b  look  count  •/ 

/•  and  add  myself  to  a  lock  count*/ 
if  (t. bCount  I"  1)  /•  does  anybody  else  hold  a  'b'  lock   •/ 

while  (lock. bCount)  ;       /•  wait  until  all  count  b  exit  •/ 
> 

An  already  granted  lock  can  lock  out  further  requests  to  the  same  lock  so  that  it  can 
obtain  the  other  lock  by  atomically  incrementing  the  count  for  the  other  lock  and  decrement- 
ing its  own  count.  When  all  the  locks  from  its  own  group  are  released,  then  the  lock  for  the 
other  group  is  granted.  This  mechanism  is  used  by  a  fair  version  of  reader-reader  lock 
which  guarantees  that  neither  lock  type  is  starved. 

If  priority  locking  is  being  utilized,  a  lower  priority  lock  can  be  changed  to  a  higher 
priority  lock,  but  not  the  reverse,  otherwise  deadlock  could  result.  If  both  locks  are 
requested  by  the  symmetrical  (non-priority)  algorithm,  then  the  either  active  lock  type  can 
be  changed.  If  there  are  no  outstanding  lower  priority  locks,  this  function  requires  one  net- 
work access. 


UltracompaUr  NoU  148  Page  S 


3.2.  The  priority  reader-reader  lock  is  correct. 

The  priority  algorithm  utilizes  lower  priority  requests  for  one  lock  type,  and  higher 
priority  requests  for  the  other. 

For  this  algorithm  to  be  correct,  it  must  both  ensure  mutual  exclusion  between  the  two 
locks,  and  it  must  neither  deadlock  nor  livelock,  which  would  prevent  either  lock  from  being 
granted. 

•  Neither  count  can  become  negative. 

The  counts  initially  are  set  to  zero.    The  count  for  either  lock  is  only  modified  by 

processes  requesting  or  releasing  it.    They  are  always  incremented  before  they  are 

decremented.    The  number  of  decrements  is  always  less  than  or  equal  to  the  number  of 

increments. 

•If  a  process  holds  a  lock,  its  count  is  greater  than  zero. 

By  above,  no  process  will  decrement  a  lock  count  without  first  incrementing  it.    If  a 

lock  is  granted,  the  call  which  requested  it  incremented  the  count  one  more  time  than  it 

decremented  the  count.   Therefore  the  value  of  the  count  must  be  greater  than  zero. 

•  Both  locks  can  not  be  granted  at  once. 

The  requested  lock's  count  is  incremented  and  the  other  lock's  count  is  tested  in  an 
atomic /aa.  By  the  serialization  principle  of  faa  (see  [GLR  83]),  all  such  requests, 
regardless  of  lock  type,  must  yield  the  same  results  as  some  serialization.  Therefore,  if 
a  lock  is  already  granted  by  one  such  operation,  subsequent  increment-test  (faa)  opera- 
tions for  the  other  lock  will  see  the  non-zero  count,  and  not  grant  the  lock. 

•  The  priority  algorithm  can  not  lock-up  Indefinitely. 

Livelock  is  a  scenario  equivalent  to  the  degenerate  naive-p  (see  [GLR  83])  where,  due 
to  some  non-deterministic  timing,  no  process  can  acquire  a  lock  for  a  prolonged 
(perhaps  infinite)  period  of  time.  Livelock  differs  from  deadlock  in  that  it  is  possible, 
in  some  serialization,  for  a  process  to  be  granted  a  lock  in  a  livelock  condition,  but  not 
in  a  deadlock  condition. 

We  shall  assume  that  there  are  a  bounded  number  of  processes  (less  than  the  maximum 
described  previously)  which  are  requesting  locks,  that  no  process  ever  attempts  to  hold 
two  locks  at  the  same  time,  and  that  every  granted  lock  will  be  released  in  finite  time. 
What  needs  to  be  shown  is  that  independent  of  the  timing  of  lock  requests  a  lock  must 
always  be  granted  in  a  finite  amount  of  time. 

It  is  trivial  to  show  that  if  only  one  type  of  lock  is  requested,  it  will  be  granted.  Since 
only  the  count  corresponding  to  the  lock  type  which  is  requested  is  incremented,  the 
count  for  the  other  lock  will  eventually  become  zero  when  all  (if  any)  locks  of  the  other 
type  are  released. 

It  is  a  little  less  obvious  why  it  is  impossible  for  this  algorithm  not  to  grant  a  lock  of 
either  type  for  an  infinite  amount  of  time.  This  can  only  occur  if  there  are  outstanding 
lock  requests  for  both  types  of  locks.  In  this  case,  the  higher  priority  lock  type  will  be 
granted  to  all  outstanding  requests  for  that  type.  This  is  because,  after  some  finite 
amount  of  time,  all  processes  requesting  the  lower  priority  lock,  having  detected  the 
contention  indicated  by  the  non-zero  other  count,  will  have  decremented  themselves  out 
of  their  own  count,  leaving  it  zero.  At  this  time,  the  higher  priority  lock  requests  will 
be  granted.    If  a  higher  degree  of  contention   is  expected,  the  lower  priority  lock 

UltracompuUr  NoU  148  Page  6 


request  algorithm  can  be  changed  so  that  it  waits  for  the  higher  priority  lock's  count  to 
become  zero  before  it  performs  the  first  decrement,  reducing  the  likelihood  of  a  pro- 
longed delay  before  the  higher  priority  lock  is  granted. 

3.2.1.   Symmetrical  (no  priority)  algorithm  can  livelock. 

It  is  necessary  (although  not  obvious)  that  one  of  the  lock  types  have  priority  over  the 
others.  Otherwise,  if  the  higher  priority  lock  code  is  modified  so  that  it  is  symmetrical  to 
the  lower  priority  lock,  there  is  a  danger  of  livelock.  This  is  because  it  is  possible  for  the 
counts  to  be  non-zero  at  a  time  that  no  locks  are  granted.  When  there  are  at  least  two  con- 
current unsatisfied  lock  requests  for  each  lock  type,  the  degenerate  condition  can  occur  when 
one  of  the  counts  is  zero,  the  other  non-zero,  and  no  locks  are  granted. 

A  process  requesting  the  corresponding  lock  to  the  non-zero  count  (assume  A)  will 
escape  the  while  loop,  and  attempt  to  increment  its  variable.  Before  it  increments  its  vari- 
able, the  other  count  (5)  is  incremented  (to  non-zero)  by  a  process  which  previously  passed 
the  initial  while  loop's  test.  The  increment-test  for  the  A  requests  and  the  B  requests  both 
fail.  When  either  of  them  decrements  its  count  to  zero,  additional  requests  for  the  other 
lock  are  allowed  to  exit  the  while  loop  and  start  their  own  increment-test/decrement  posting 
sequence.  In  this  manner,  requests  for  each  lock  can  exit  the  while  loop  when  no  locks  are 
granted,  even  though  the  counts  are  non-zero. 


SUU 

Action 

A  Count 

B  Count 

Hold*  Lock 

1 

AiAj'Bi  P***  while  test 

0 

0 

nobody 

2 

Ai+  +  succeeds;  granted 

Ai 

0 

Ai 

3 

52+  +  rfaUi 

Ai 

52 

Ai 

4 

A  i"(rclea»e) 

0 

52 

nobody 

5 

A 1  exits 

0 

52 

nobody 

6 

Bi  passes  while  test 

0 

52 

nobody 

7 

A2+  +  rfails 

A2 

52 

nobody 

8 

S2"Joops 

A2 

0 

nobody 

9 

A I  passes  while  test 

Ai 

0 

nobody 

10 

Bi+  +  dn\\% 

Ai 

Si 

nobody 

11 

A2"Joopi 

0 

5i 

nobody 

12 

B2  passes  while  test 

0 

fll 

nobody 

13 

Ai+  +  daili 

Ax 

5l 

nobody 

14 

fil--doops 

Ai 

0 

nobody 

15 

b  passes  while  test 

Ax 

0 

nobody 

16 

B2+  +  d^i 

Ax 

52 

nobody 

17 

Ai"-Aoops 

0 

Bi 

nobody 

6 

Bi  passes  while  test 

0 

Bi 

nobody 

3.3.   Other  flavors  of  related  reader-reader  locks 

A  symmetrical  algorithm,  without  priority,  can  be  realized  by  adding  a  pair  of  boolean 
fields,  one  to  indicate  if  each  lock  is  granted,  into  the  bitfield  integer  which  indicates  if  any 


Ultracompoter  Note  148 


Page? 


process  actually  holds  either  lock.  The  initial  test  is  modified  so  that  requests  for  a  particu- 
lar lock  will  be  posted  iff  both  the  other  count  is  non-zero  (as  before)  and  if  the  count  for 
locks  of  the  same  type  is  non-zero  or  a  lock  of  the  same  type  is  granted  (as  indicated  by  the 
corresponding  boolean  flag). 

An  asymmetric  algorithm  which  guarantees  that  neither  lock  type  can  be  starved  by  a 
continuous  stream  of  either  type  of  lock  request  can  also  be  realized  by  extending  the  prior- 
ity lock.  This  fair  algorithm  counts  unsatisfied  requests,  and  forces  the  lock  to  regularly 
change  phase  if  there  are  unsatisfied  requests. 

4.   Reader-writer  lock 

A  readers-writers  lock  ([GLR  83])  is  a  synchronization  algorithm  which  protects  a  criti- 
cal section  of  code.  Processes  are  broken  into  two  groups,  readers  and  writers.  Any  number 
or  reader  processes  are  allowed  to  hold  the  lock  concurrently,  but  only  one  writer  process  is 
allowed  to  hold  a  lock  at  a  time.  Of  course,  reader  and  writer  locks  are  mutually  exclusive. 
A  simple  priority  reader-reader  lock  which  only  allows  a  single  reader  from  the  higher 
priority  group  to  enter  the  critical  section.  This  algorithm  is  equivalent  to  the  one  presented 
in  [GLR  83],  however  that  algorithm  did  not  formalize  the  notion  of  bitfield  variables  and 
their  general  usefulness. 

typedef  rrLoclcT   rwLockT; 

#da£ina  rwLoc)cReader  <  lock  )    rrLockB(  lock) 

#da£ina  rwUnlockRaadar ( lock)    rrUnlockBC lock) 

#da£ina  rwUnlockWrltar ( lock)    rrnnlockA( lock) 

rrLookWrltar ( lock) 
rwLockT   'lock; 
{ 

abCountT  pbasa; 
top: 
phaaa    ■    £aa1 (&lock->abCount ,    aCount,    1);    /•   post   by   incrafflanting   aCount   •/ 
if    (phaaa .aCount    I-    0)    {  /•    ara   wa    first?    •/ 

phasa    ■    f aa(&lock->abCount ,    aCount,    -1);    /•    no;    decramant,    wait   till    0    •/ 
for    ( phaaa . aCount    ->    1;    phasa. aCount    !■    0;    phasa   •>    lock->abCount ) ; 
goto   top; 
}    alsa  /•   wa   ara    first   •/ 

whila    (phasa .bCount    I'    0)    phasa    ■    lock->abCount ;    /•   wait    for   Bs    to   exit    •/ 
} 


5.   Wave  and  Phase  Synchronization 

Many  algorithms  are  synchronous  in  nature,  having  sections  which  need  to  work  in 
lock-step  with  varying  granularity.  In  such  algorithms,  a  critical  section  extends  through 
several  sychronization  points  where  all  processes  in  the  critical  section  must  arrive  before 
they  can  continue. 

5.1.   Barrier  synchronization 

Barrier  synchronization  provides  a  mechanism  to  synchronize  a  group  of  processes 
which  need  to  step  through  a  problem  together.   When  a  program  calls  barrier()  it  is  delayed 

UltracompuUr  Note  148  Page  8 


until  all  of  the  processes  with  which  it  is  synchronizing  also  call  barrier. 
The  data  structure  is  a  bitfield  integer  with  two  fields: 


typedef  struct  { 
unsigned  15:P; 
unsigned  17:B; 

}  barrierT; 


/•  #  processes  •/ 

/«  barrier  count,  see  range<)   •/ 


Which  is  Stored  as  follows: 


1 

Field  Allocation  for  32  Bit  Barrier 

Name 

P 

B 

Usage 

number  of  processes 

barrier  count 

Useful  range 

maximum  number  of  processes 

0..3*P 

Field  width 

15  bits 

17  bits 

Bit  locations 

31. .17 

16. .0 

The  P  field  is  initialized  to  the  number  of  processes  which  will  be  synchronizing,  and 
the  B  field  is  initialized  to  zero. 

/•  • 

range  returns  true  if  B  is  in  range  P. .2  •  P  -1 

normally  coded  as  a  macro;  shown  as  a  function  for  clarity 
•  •/ 

range(b) 
barrierT  b; 
{ 

return(b.B  >-  b.P  &&  b.B  <  b.P  •  2); 


/•  • 

barrier()  wait  until  'P'  processes  call  'barrier', 
based  on  Dimitrovsky  barrier  algorithm 
resets  for  next  barrier  automatically 

•  •/ 

barrier (bar ) 

barrierT  'bar; 

{ 


barrierT  t; 
int  oldRange; 
t  ■  faa1(bar,  B,  1); 
OldRange  •  range(t); 
t.B  -  t.B  +  1 ; 
if  (oldRange  ■■  range(t))  { 
if  (oldRange  ■•  1) 

faaKbar.  B.  -2  •  t.P); 


/•  register  myself  at  barrier  •/ 
/•  note  the  range  before  I  registered  •/ 
/•  adjust  B  to  value  after  I  registered  •/ 
/•  am  I  last  to  register?  •/ 
/•  did  I  increment  from  phase  1->0  •/ 
/•  yes,  reset  to  low  of  range  0  •/ 


Ultracomputer  Note  148 


Page  9 


41, 


}  else  /•  I  was  not  last  to  register  •/ 

while  (oldRange  ««  range(*bar)]  ;  /•  wait  for  others  to  register  •/ 
} 

The  algorithm  works  by  incrementing  the  B  field  through  two  ranges:  range  0  includes 
values  from  0  to  P— 1  and  range  1  includes  values  from  P  to  2*P-1.  Values  greater  than  or 
equal  to  2*P  are  considered  as  part  of  range  0.  At  each  sync  point,  each  process  calls  bar- 
Sync  which  uses  fetch-and-add  to  increment  B  by  one,  noting  the  range  which  B  was  previ- 
ously in.  Then  all  processes  wait  until  B  is  incremented  to  the  next  range.  When  a  process 
whose  increment  causes  the  transition  from  range  1  to  0  (incrementing  from  2*P  — 1  to  2*P, 
it  will  decrement  B  by  2*P. 

Observe  value  of  the  B  field  is  bounded  by  3*P  for  the  following  reason:  At  synchroni- 
zation points  where  B  field  started  in  range  zero,  all  processes  must  wait  until  B  is  incre- 
mented to  the  value  2*P.  While  one  of  these  processes  is  decrementing  B  by  the  same 
amount,  the  other  processes  may  continue  on  to  the  next  barrier  point.  Assume  that  all  of 
the  other  processes  reach  the  next  barrier  point  before  B  is  decremented.  If  the  slow  process 
is  a  member  of  the  group  of  processes  which  will  be  synchronizing  at  the  next  barrier,  only 
P  —  l  processes  are  available  to  increment  B,  restricting  the  value  of  B  to  less  than  3*P  — 1. 
If  this  process  is  not  a  member  of  that  group,  then  the  limit  is  3*P.  Because  of  its  range, 
two  more  bits  are  allocated  to  B  than  P,  causing  the  maximum  number  of  processes  which 
can  use  this  implementation  (with  32  bit  words)  to  be  restricted  to  2^^— 1  which  is  32767. 

The  cost  of  a  call  to  barrierQ  is  at  least  one  network  access  for  one  process  in  the  group 
(on  the  transition  from  range  0  to  range  1),  and  at  least  two  for  the  others.  Dimitrovsky's 
original  implementation  required  two  or  three  accesses  because  the  number  of  processes  was 
kept  in  a  separate  variable.  Admittedly,  this  savings  is  small,  and  since  the  value  of  P  is 
invariant  over  the  critical  section,  it  could  be  copied  into  private  memory.  This  implementa- 
tion also  differs  from  that  presented  in  [Dimitrovsky  85]  in  that,  he  did  not  consider  values 
of  B  greater  than  2*P  —  1  to  be  in  range  0,  therefore  all  of  the  processes  needed  to  wait  until 
after  B  was  reset  to  zero,  causing  at  least  one  extra  delay  due  to  network  references  for  most 
processes. 

5.1.1.  Two  Phase  Lock 

A  simple  two  phase  barrier  synchronization  algorithm  can  be  achieved  utilizing  the 
priority  reader-reader  lock  presented  above.  The  first  phase  is  protected  by  lock  B,  and  the 
second  phase  is  protected  by  lock  A.  Since  these  two  locks  are  exclusive,  there  is  no  risk  of 
processes  being  in  both  phases  at  the  same  time.  Entry  to  the  first  phase  is  granted  by  a  call 
to  lockBO,  the  transition  to  phase  two  is  achieved  by  calling  changeBtoAQ  which  locks  out 
further  requests  for  the  first  phase  and  locks  the  second  phase.  The  lock  is  released  by  cal- 
ling unlockB .  If  the  fair  reader-reader  lock  mentioned  above  is  used  instead  of  the  priority 
lock,  this  lock  will  also  have  the  wave  property.  A  feature  of  this  algorithm  is  that  the 
number  of  processes  which  are  synchronizing  together  is  determined  automatically. 


Ultracompntcr  Note  148  Page  10 


ImplcmenUton  of  Two  Phase  Lock 

function 

what  It  does 

how  Implemented 

mln  cost  In  network  refs 

twoPhascLockO 

enter  first  phase 

rrLockB 

1 

twoPhascLockOtol 

enter  second  phase  from  first 

rrChangeBtoA 

1 

twoPhascUnlockl 

exit  second  phase 

rrUnlockA 

1 

twoPhascUnlockO 

exit  first  phase  without  entering  second 

rrUnlockB 

1 

5.2.   Wave  Synchronization 

Wave  synchronization  (see  [Dimitrovsky  86])  breaks  processes  requesting  entry  to  a 
critical  section  into  groups  such  that  a  process  is  only  allowed  to  enter  the  critical  section  no 
more  than  one  time  for  each  group.  This  smaller  group  can  synchronize  at  barrier  points 
inside  the  critical  section.  Furthermore,  a  process  which  is  waiting  for  entry  to  such  a  criti- 
cal section  only  must  wait  until  the  current  group  releases  all  of  its  locks  before  it  can  enter 
with  the  next  group.    Note  that  the  two  phase  barrier  algorithm  has  the  wave  property. 

5.2.1.   Group  Lock 

Group  lock  is  a  barrier  synchronization  algorithm  (see  [Dimitrovsky  86])  which  is  use- 
ful for  algorithms  where  a  group  of  processes  need  to  synchronize  at  one  or  more  barriers 
and  the  size  of  the  group  is  unknown.  It  also  has  the  wave  property.  It  consists  of  three 
subroutines:  gLock,  gSync,  and  gUnlock.   Typical  usage  is  as  follows: 

gLockO 

. . .  code  for  phase  0 

gSync() 

...  code  for  phase  1 


gSyncO 

. . .  code  for  phase  n 

gUnlockO 

The  effect  of  group  lock  in  the  above  example  is  to  make  sure  that  processes  are  only 
executing  code  in  one  phase  at  a  time.  Processes  which  call  gLock()  when  other  processes 
hold  active  locks  must  wait  until  the  glock  is  next  available.  This  algorithm  (as  does  the  ori- 
ginal group-lock  implementation)  guarantees  that  a  process  which  must  wait  for  the  lock  to 
be  released  will  be  granted  the  lock  with  the  next  group. 

Several  implementations  of  group  lock  using  bitfield  variables  have  been  developed. 
The  following  implementation,  utilizes  only  thirty-two  bits  of  storage,  limiting  it  to  2^°— 1 
processes.  Less  elegant  implementations  have  been  developed  which  require  more  storage 
but  allow  a  much  larger  number  of  processes  to  use  the  lock. 

This  algorithm  requires  three  bitfields  as  follows: 


Ultracompatcr  NoU  148 


P«ge  11 


Field  AllocaUon  for  32  Bit  Group  Lock 

Name 

W 

N 

B 

Usage 

#  of  waiting  processes 

#  of  processes  in  critical  section 

barrier  count 

Useful  range 

0..210-1 

0..210-1 

0..3*N 

Field  width 

10 

10 

12 

Bit  locations 

31. .22 

21. .12 

11. .0 

The  lock  algorithm  is  outlined  in  the  floA'chart  below;  operations  which  modify  the  syn- 
chronization variable  and  must  occur  atomically  using  a  single  fetch-and-add  are  shown 
inside  boxes  and  comments  are  italicized. 


£»«« 

W*-l 

If  BwM  «|ultoxvo 

1         . 

W—  IJS  *-  I 

wtai  B  k  aq«d  ID  MTo 


W--Oor 
fkft  gM  throofh 


iU^oM  7W  Wawm 


The  lock  works  as  follows.  Processes  which  would  like  to  enter  the  critical  section  first 
increment  the  waiting  count  W  and  test  the  atomically  returned  value  of  the  barrier  count  B 
to  determine  whether  B  was  zero,  indicating  that  this  process  can  enter  the  lock  with  the 
current  wave.  A  non-zero  value  of  B  would  have  indicated  that  there  are  already  processes 
inside  the  lock,  requiring  the  process  to  wait  until  the  next  group. 


UltracompaUr  Note  148 


Page  12 


Once  a  process  has  determined  that  it  will  enter  the  lock,  the  first  step  is  to  atomically 
increment  N  and  decrement  W  by  one.    Note  that  this  operation  does  not  affect  the  sum 

W  +  N. 

In  order  to  ensure  that  all  processes  which  were  required  to  wait  until  the  previous 
group  released  the  lock,  all  of  the  processes  must  wait  until  at  least  one  of  them  sees  the 
value  of  W  decremented  to  zero,  indicating  that  there  are  no  processes  at  the  entry  waiting 
step. 

Once  a  process  (possibly  several)  notices  that  W  has  a  zero  value,  it  increments  B,  not- 
ing the  values  of  all  three  fields  at  the  time  of  the  increment.  Note  that  the  first  of  these 
increments  makes  B  non-zero,  causing  processes  which  subsequently  try  to  enter  (by  incre- 
menting W)  to  wait  until  the  next  group  is  allowed  to  enter  the  first  phase  of  the  critical  sec- 
tion. The  non-zero  value  of  B  also  ensures  that  the  processes  in  the  regisitration  waiting  will 
continue  and  increment  B.  The  first  process  to  increment  B  knows  the  total  number  of 
processes  in  this  group,  which  is  W-¥N .  Also  observe  that  until  this  process  performs  a 
second  increment  on  B,  B^N. 

Once  all  of  the  processes  have  locked  the  wave  by  incrementing  B,  and  wait  until 
B=N  =  actual  number  of  processes  in  this  wave.  All  of  these  processes  except  the  first  now 
wait  for  B  to  take  on  a  value  greater  than  N .  The  first  process  notices  that  all  of  the  others 
have  incremented  B  and  performs  the  extra  increment,  allowing  all  of  the  processes  to  enter 
the  lock. 

The  function  gBarrierQ  works  analogously  to  the  function  barrier{)  described  above 
with  the  definition  of  the  ranges  modified  as  shown  below. 


Ultracompnter  Note  148  Page  13 


DfaoitnTfiky'i  bvTltf 


1  )aOroiipL«± 


I 

V 

0 

1' 


T-, 


__i 


Oraip  Lo^  tnm 


P  B 


W  N  B 


.1 


I            N            I            B 
I I 


The  unlock  function  works  by  decrementing  B.  The  last  process  to  decrement  B  (as 
evidenced  by  5  =  1  or  B=N+l)  performs  an  additional  subtraction  of  both  the  value  of  B 
and  N.  This  function  requires  either  one  or  two  network  accesses.  A  slightly  more  expen- 
sive alternative  unlock  function  can  also  be  constructed  which  allows  some  processes  to  leave 
the  critical  section  between  synchronization  points. 

This  algorithm  requires  at  least  seven  network  accesses  for  a  lock/barrier/unlock 
sequence.  This  compares  favorably  with  the  minimum  of  twenty  seven  required  by  the 
non-bitfield  implementation. 

6.   Conclusion 

In  our  work  developing  environment  support  packages  for  for  the  NYU  Ultracomputer, 
we  realized  the  potential  benefits  of  a  tool  able  to  atomically  read  and  modify  several  state 
variables.  This  led  us  to  implement  fetch-and-add  on  bitfield  integers  as  presented  in  this 
paper.  The  algorithms  which  we  have  developed  using  this  technique  have  proven  to  be  very 
efficient.  We  are  presently  working  on  memory  management  algorithms  which  utilize  these 
synchronization  tools. 


Ultracomputer  Note  148 


Page  14 


7.   Acknowledgments 

We  greatly  appreciate  the  help  of  Allan  Gottlieb,  Jim  Lipkis,  Isaac  Dimitrovsky, 
Wayne  Berke,  Brian  Anderson,  and  Lawrence  Kaplan  for  their  patience  and  insights  both 
during  the  development  of  the  algorithms  presented  here  and  of  this  paper. 


References 

[Dimitrovsky  85]  Isaac  Dimitrovsky,  A  Short  Note  on  Barrier  Synchronization,  Ultracom- 
puter  System  Software  Note  #59,  Courant  Institute,  NYU,New  York,  1985. 

[Dimitrovsky  86]  Issac  Dimitrovsky,  A  Group  Lock  Algorithm,  With  Applications,  Ultra- 
computer  Note  #112,  Courant  Institute,  NYU,New  York,  1986. 

[EdIer  84]  Jan  Edler,  Readers/Readers  Synchronization,  Ultracomputer  System  Software 
Note  #46,  Courant  Institute,  NYU, New  York,  1984. 

Gottlieb  86]  Allan  Gottlieb,  An  Overview  of  the  NYU  Ultracomputer  Project,  Ultracom- 
puter Note  #100,  Courant  Institute,  NYU,  New  York,  1986. 

[GLR  83]  Allan  Gottlieb,  Boris  D.  Lubachebsky,  and  Lawrence  Rudolph,  Basic  Tech- 
niques for  Efficient  Coordination  of  Very  Large  Numbers  of  Cooperating  Processors,  ACM 
TOPLAS  5,  pp.  164-189,  April  1983b. 

[Harrison  86]  Malcolm  Harrison,  The  Add-and-Lambda  Operation:  An  Extension  of 
F&A,  Ultracomputer  Note  #104,  Courant  Institute,  NYU,  New  York,  1986.   ucn  100 

[Harrison  88]  Malcolm  Harrison  More  Uses  for  Add  and  Lambda:  Eliminating  Busy 
Waits,  Ultracomputer  Note  #132,  Courant  Institute,  NYU,  New  York,  1988.  Courant 
Institute,  NYU,  New  York,  1988. 

[K&R  78]  Brian  Kernighan  and  Dennis  M.  Ritchie,  The  C  Programming  Language,  (c) 
1978  Prentice  Hall,  Inc.,  Englewood  Cliffs  NJ. 


Ultracomputer  Note  148  Page  15 


